## record current position
current = 5
## flip coin to generate proposal
(coin = sample( c(-1,1) , size=1 ))[1] -1
[1] 4
[1] 0.8
[1] 4
mcmc – brms
Office hours will be on Tuesdays (10-12) starting next week.
10 islands, in a loop. Island 2 is twice as big as island 1, island 3 is three times as big as island 1, etc.
num_weeks <- 1e5
positions <- rep(0,num_weeks)
current <- 10
for ( i in 1:num_weeks ) {
## record current position
positions[i] <- current
## flip coin to generate proposal
proposal <- current + sample( c(-1,1) , size=1 )
## now make sure he loops around the archipelago
if ( proposal < 1 ) proposal <- 10
if ( proposal > 10 ) proposal <- 1
## move?
prob_move <- proposal/current
current <- ifelse( runif(1) < prob_move , proposal , current )
}It works by drawing samples from conditional distributions of each parameter given current values of all other parameters. The process iteratively updates one parameter at a time, eventually converging to the joint posterior distribution. Gibbs sampling is particularly effective for high-dimensional problems where direct sampling is difficult, and it’s computationally efficient because it only requires conditional distributions rather than the full joint distribution.
However, both Metropolis and Gibbs sampling are inefficient as models become more complex. For that reason, we’ll skip right ahead to Hamiltonian Monte Carlo sampling.
The warmup period of Stan is figuring out what the leapfrog steps and step size should be. This uses an algorithm called a No-U-Turn Sampler (NUTS).
Let’s return to the height and weight data.
data(Howell1, package = "rethinking")
d <- Howell1
library(measurements)
d$height <- conv_unit(d$height, from = "cm", to = "feet")
d$weight <- conv_unit(d$weight, from = "kg", to = "lbs")
describe(d, fast = T) vars n mean sd median min max range skew kurtosis se
height 1 544 4.54 0.91 4.88 1.77 5.88 4.1 -1.26 0.58 0.04
weight 2 544 78.51 32.45 88.31 9.37 138.87 129.5 -0.54 -0.94 1.39
age 3 544 29.34 20.75 27.00 0.00 88.00 88.0 0.49 -0.56 0.89
male 4 544 0.47 0.50 0.00 0.00 1.00 1.0 0.11 -1.99 0.02
\[\begin{align*} w_i &\sim \text{Normal}(\mu_i, \sigma) \\ \mu_i &= \alpha + \beta (h_i - \bar{h}) \\ \alpha &\sim \text{Normal}(130, 20) \\ \beta &\sim \text{Normal}(0, 25) \\ \sigma &\sim \text{Uniform}(0, 25) \\ \end{align*}\]
brm() is the core function for fitting Bayesian models using brms. This function translates your model above into a Stan model, which in turn defines the sampler and does the hard work. This is running in the background of your computer. There are many programs that can be used to run Stan, and you can even run Stan directly if you want.
// generated with brms 2.22.0
functions {
}
data {
int<lower=1> N; // total number of observations
vector[N] Y; // response variable
int<lower=1> K; // number of population-level effects
matrix[N, K] X; // population-level design matrix
int<lower=1> Kc; // number of population-level effects after centering
int prior_only; // should the likelihood be ignored?
}
transformed data {
matrix[N, Kc] Xc; // centered version of X without an intercept
vector[Kc] means_X; // column means of X before centering
for (i in 2:K) {
means_X[i - 1] = mean(X[, i]);
Xc[, i - 1] = X[, i] - means_X[i - 1];
}
}
parameters {
vector[Kc] b; // regression coefficients
real Intercept; // temporary intercept for centered predictors
real<lower=0,upper=50> sigma; // dispersion parameter
}
transformed parameters {
real lprior = 0; // prior contributions to the log posterior
lprior += normal_lpdf(b | 0, 25);
lprior += normal_lpdf(Intercept | 130, 20);
lprior += uniform_lpdf(sigma | 0, 50)
- 1 * log_diff_exp(uniform_lcdf(50 | 0, 50), uniform_lcdf(0 | 0, 50));
}
model {
// likelihood including constants
if (!prior_only) {
target += normal_id_glm_lpdf(Y | Xc, Intercept, b, sigma);
}
// priors including constants
target += lprior;
}
generated quantities {
// actual population-level intercept
real b_Intercept = Intercept - dot_product(means_X, b);
}
family specifies the distribution of the outcome family. In many examples, we’ll use a gaussian (normal) distribution. But there are many many many options for this.
The formula argument is what you would expect from the lm() and lmer() functions you have seen in the past. The benefit of brms is that this formula can easily handle complex and non-linear terms. We’ll be playing with more in future classes.
Here we set our priors. Class b refers to slope parameters or beta coefficients. Again, this argument has the ability to become very detailed, specific, and flexible, and we’ll play more with this.
Hamiltonian MCMC runs for a set number of iterations, throws away the first bit (the warmup), and does that up multiple times (the number of chains).
warmup defaults to iter/2, but you can set it to something else. Remember, the warmup is not simply a burn-in period. It’s used to figure out the appropriate leapfrog and step size settings. So it’s worth allowing this to be large-ish. Also note that the warmup period will generally run more slowly than the sampling period.
In general, it’s recommended that you use one short chain to debug, four longer chains for estimation, verification, and inference.
Remember, these are random walks through parameter space, so set a seed for reproducbility. Also, these can take a while to run, especially when you are developing more complex models. If you specify a file, the output of the model will automatically be saved. Even better, then next time you run this code, R will check for that file and load it into your workspace instead of re-running the model. (Just be sure to delete the model file if you make changes to any other part of the code.)
m1 <-brm(
data = d,
family = gaussian,
weight ~ 1 + height_c,
prior = c( prior( normal(130,20), class = Intercept),
prior( normal(0,25), class = b),
prior( uniform(0,50), class = sigma, ub = 50)
),
iter = 5000, warmup = 1000, chains = 4,
cores = 4,
seed = 3,
file = here("files/data/generated_data/m1"))You can parallelized your chains (run them in parallel instead of in sequence) by defining how many cores you want to run. Your computer (probably) has at least 4 cores, so 4 is a good choice here. But be warned that your computer may slow down in other places (while the code is running), and may even run a bit hot.
Family: gaussian
Links: mu = identity; sigma = identity
Formula: weight ~ 1 + height_c
Data: d (Number of observations: 352)
Draws: 4 chains, each with iter = 5000; warmup = 1000; thin = 1;
total post-warmup draws = 16000
Regression Coefficients:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept 99.21 0.50 98.23 100.18 1.00 14807 12164
height_c 42.05 1.95 38.22 45.91 1.00 15921 12552
Further Distributional Parameters:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma 9.38 0.36 8.72 10.12 1.00 15626 12366
Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
To evaluate your model, the first question to ask is “did your chains mix?”
This is not \(N\) like the number of participants in your sample. This is the number of draws from the posterior. The effective number is an estimate of the number of independent samples from the posterior (remember probability). Markov chains are typically autocorrelated, so that sequential samples are not entirely independent. This happens when chains explore the posterior slowly, like in a Metropolis algorithm. Autocorrelation reduces the effective number of samples. Think of effective sample size like the length of a Markov chain with no autocorrelation that is the same quality as your estimate.
brms gives you two estimates for effective sample size: bulk and tail.
How many do you need? It depends on how much skewed your posterior is.
Family: gaussian
Links: mu = identity; sigma = identity
Formula: weight ~ 1 + height_c
Data: d (Number of observations: 352)
Draws: 4 chains, each with iter = 5000; warmup = 1000; thin = 1;
total post-warmup draws = 16000
Regression Coefficients:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept 99.21 0.50 98.23 100.18 1.00 14807 12164
height_c 42.05 1.95 38.22 45.91 1.00 15921 12552
Further Distributional Parameters:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma 9.38 0.36 8.72 10.12 1.00 15626 12366
Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
The Gelman-Rubin convergence metric \((\hat{R})\) is a comparison of between-chain variance to within-chain variance. You want this number to be as close to 1 as you can.
Note that a value of 1 doesn’t mean that you’re fine. It’s a sign of danger, but not a sign of safety.
Family: gaussian
Links: mu = identity; sigma = identity
Formula: weight ~ 1 + height_c
Data: d (Number of observations: 352)
Draws: 4 chains, each with iter = 5000; warmup = 1000; thin = 1;
total post-warmup draws = 16000
Regression Coefficients:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept 99.21 0.50 98.23 100.18 1.00 14807 12164
height_c 42.05 1.95 38.22 45.91 1.00 15921 12552
Further Distributional Parameters:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma 9.38 0.36 8.72 10.12 1.00 15626 12366
Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
A rule in physics is that energy must be conserved. Remember that HMC is literally a physics simulation. Energy won’t be conserved in the system if your skateboarder zings out of the bowl and into outer space. That’s a divergent transition.
Two primary ways to handle divergent transitions are by increasing the adapt_delta parameter and reparameterization.
adapt_delta specifies the target average acceptance probability during the adaptation phase of sampling (warmup). The higher the delta, the smaller the step size. This makes the model more conservative, slower, and also much less likely to encounter problems. Its default is .8. Try increasing it to .9 or .99 if you have divergent transitions.
adapt_delta is like adjusting how smooth and controlled the skateboarder’s ride should be. * A higher adapt_delta (e.g., 0.99) means the skateboarder is super cautious, preferring smooth, stable rides (i.e., high acceptance rates). This leads to smaller step sizes and more leapfrog steps per iteration. * A lower adapt_delta (e.g., 0.8) lets the skateboarder be riskier, zooming around the bowl faster, but at a higher risk of crashing (divergent transitions).